Sixt Data Science Lab - Test Task for Data Scientist Job Candidates¶
Introduction¶
In this test task you will have an opportunity to demonstrate your skills of a Data Scientist from various angles - processing data, analyzing and vizalizing it, finding insights, applying predictive techniques and explaining your reasoning about it.
The task is based around a bike sharing dataset openly available at UCI Machine Learning Repository [1].
Please go through the steps below, build up the necessary code and comment on your choices.
Part 1 - Data Loading and Environment Preparation¶
Tasks:
- Prepare a Python 3 virtual environment (with virtualenv command). requirements.txt output of pip freeze command should be included as part of your submission.
- Load the data from UCI Repository and put it into the same folder with the notebook. The link to it is https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset . Here is an available mirror in case the above website is down: https://data.world/uci/bike-sharing-dataset
- We split the data into two parts. One dataset containing the last 30 days and one dataset with the rest.
import pandas as pd
import numpy as np
# read raw data
df_all = pd.read_csv('day.csv')
# split dataset
df_last30 = df_all.tail(30)
df = df_all.iloc[:-30, :]
df.head()
| instant | dteday | season | yr | mnth | holiday | weekday | workingday | weathersit | temp | atemp | hum | windspeed | casual | registered | cnt | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 2011-01-01 | 1 | 0 | 1 | 0 | 6 | 0 | 2 | 0.344167 | 0.363625 | 0.805833 | 0.160446 | 331 | 654 | 985 |
| 1 | 2 | 2011-01-02 | 1 | 0 | 1 | 0 | 0 | 0 | 2 | 0.363478 | 0.353739 | 0.696087 | 0.248539 | 131 | 670 | 801 |
| 2 | 3 | 2011-01-03 | 1 | 0 | 1 | 0 | 1 | 1 | 1 | 0.196364 | 0.189405 | 0.437273 | 0.248309 | 120 | 1229 | 1349 |
| 3 | 4 | 2011-01-04 | 1 | 0 | 1 | 0 | 2 | 1 | 1 | 0.200000 | 0.212122 | 0.590435 | 0.160296 | 108 | 1454 | 1562 |
| 4 | 5 | 2011-01-05 | 1 | 0 | 1 | 0 | 3 | 1 | 1 | 0.226957 | 0.229270 | 0.436957 | 0.186900 | 82 | 1518 | 1600 |
Part 2 - Data Processing and Analysis¶
Tasks:
- Perform all needed steps to load and clean the data. Please comment the major steps of your code.
- Visualise rentals of bikes per day.
- Assume that each bike has exactly maximum 12 rentals per day.
- Find the maximum number of bicycles
nmaxthat was needed in any one day. - Find the 95%-percentile of bicycles
n95that was needed in any one day.
- Find the maximum number of bicycles
- Visualize the distribution of the covered days depending on the number of available bicycles (e.g.
nmaxbicycles would cover 100% of days,n95covers 95%, etc.)
Pat2 - Task 1 - Perform all needed steps to load and clean the data. Please comment on the major steps of your code.¶
Inspect the Data¶
# Check the shape of the dataset (rows, columns)
print(f"Dataset contains {df.shape[0]} rows and {df.shape[1]} columns.")
# Check for missing values
print("Missing values in each column:")
print(df.isnull().sum())
# Check the data types of each column
print("\nData types of each column:")
print(df.dtypes)
df.describe()
Dataset contains 701 rows and 16 columns. Missing values in each column: instant 0 dteday 0 season 0 yr 0 mnth 0 holiday 0 weekday 0 workingday 0 weathersit 0 temp 0 atemp 0 hum 0 windspeed 0 casual 0 registered 0 cnt 0 dtype: int64 Data types of each column: instant int64 dteday object season int64 yr int64 mnth int64 holiday int64 weekday int64 workingday int64 weathersit int64 temp float64 atemp float64 hum float64 windspeed float64 casual int64 registered int64 cnt int64 dtype: object
| instant | season | yr | mnth | holiday | weekday | workingday | weathersit | temp | atemp | hum | windspeed | casual | registered | cnt | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 701.000000 | 701.000000 | 701.000000 | 701.000000 | 701.000000 | 701.000000 | 701.000000 | 701.000000 | 701.000000 | 701.000000 | 701.000000 | 701.000000 | 701.000000 | 701.000000 | 701.000000 |
| mean | 351.000000 | 2.479315 | 0.479315 | 6.285307 | 0.028531 | 3.004280 | 0.684736 | 1.385164 | 0.502732 | 0.480847 | 0.625717 | 0.190534 | 866.937233 | 3661.104137 | 4528.041369 |
| std | 202.505555 | 1.090839 | 0.499929 | 3.329294 | 0.166602 | 2.003207 | 0.464953 | 0.542489 | 0.182781 | 0.162584 | 0.141988 | 0.076740 | 693.470674 | 1553.467783 | 1939.766889 |
| min | 1.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.059130 | 0.079070 | 0.000000 | 0.022392 | 2.000000 | 20.000000 | 22.000000 |
| 25% | 176.000000 | 2.000000 | 0.000000 | 3.000000 | 0.000000 | 1.000000 | 0.000000 | 1.000000 | 0.343478 | 0.348470 | 0.519167 | 0.134958 | 317.000000 | 2507.000000 | 3194.000000 |
| 50% | 351.000000 | 2.000000 | 0.000000 | 6.000000 | 0.000000 | 3.000000 | 1.000000 | 1.000000 | 0.514167 | 0.503146 | 0.623750 | 0.182221 | 738.000000 | 3656.000000 | 4541.000000 |
| 75% | 526.000000 | 3.000000 | 1.000000 | 9.000000 | 0.000000 | 5.000000 | 1.000000 | 2.000000 | 0.660000 | 0.613025 | 0.728750 | 0.233221 | 1135.000000 | 4739.000000 | 6041.000000 |
| max | 701.000000 | 4.000000 | 1.000000 | 12.000000 | 1.000000 | 6.000000 | 1.000000 | 3.000000 | 0.861667 | 0.840896 | 0.972500 | 0.507463 | 3410.000000 | 6946.000000 | 8714.000000 |
Handle Missing Values¶
# Since there are no missing values, no further action is needed here
# If there were missing values, we could fill them or drop the rows
Convert Data Types¶
# Convert 'dteday' to datetime
df['date'] = df['dteday'].apply(pd.to_datetime)
df = df.drop('dteday', axis=1)
# Confirm the conversion
print("\nData type of 'dteday' after conversion:")
print(df['date'].dtype)
Data type of 'dteday' after conversion: datetime64[ns]
C:\Users\Akhod\AppData\Local\Temp\ipykernel_11156\1264700120.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df['date'] = df['dteday'].apply(pd.to_datetime)
Remove Unnecessary Columns¶
# Drop the'instant' as the column is just a record index, it is not necessary.
df = df.drop(columns=['instant'])
# Check the dataset after dropping the column
print("\nColumns after dropping 'instant':")
print(df.columns)
Columns after dropping 'instant':
Index(['season', 'yr', 'mnth', 'holiday', 'weekday', 'workingday',
'weathersit', 'temp', 'atemp', 'hum', 'windspeed', 'casual',
'registered', 'cnt', 'date'],
dtype='object')
Handle Categorical Variables¶
# Convert 'season', 'weathersit', 'weekday', and 'mnth' to categorical data types
categorical_columns = ['season', 'weathersit', 'weekday', 'mnth', 'holiday', 'workingday', 'yr']
for col in categorical_columns:
df[col] = df[col].astype('category')
# Confirm the conversion
print("\nData types after conversion to categorical:")
print(df.dtypes)
Data types after conversion to categorical: season category yr category mnth category holiday category weekday category workingday category weathersit category temp float64 atemp float64 hum float64 windspeed float64 casual int64 registered int64 cnt int64 date datetime64[ns] dtype: object
Normalize or Scale Data¶
# The data is already normalized, so this step might not be necessary.
Final Inspection¶
# Tthe first five rows
df.head()
# Summary of the dataset
df.describe(include='all')
| season | yr | mnth | holiday | weekday | workingday | weathersit | temp | atemp | hum | windspeed | casual | registered | cnt | date | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 701.0 | 701.0 | 701.0 | 701.0 | 701.0 | 701.0 | 701.0 | 701.000000 | 701.000000 | 701.000000 | 701.000000 | 701.000000 | 701.000000 | 701.000000 | 701 |
| unique | 4.0 | 2.0 | 12.0 | 2.0 | 7.0 | 2.0 | 3.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| top | 3.0 | 0.0 | 1.0 | 0.0 | 6.0 | 1.0 | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| freq | 188.0 | 365.0 | 62.0 | 681.0 | 101.0 | 480.0 | 451.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| mean | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.502732 | 0.480847 | 0.625717 | 0.190534 | 866.937233 | 3661.104137 | 4528.041369 | 2011-12-17 00:00:00 |
| min | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.059130 | 0.079070 | 0.000000 | 0.022392 | 2.000000 | 20.000000 | 22.000000 | 2011-01-01 00:00:00 |
| 25% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.343478 | 0.348470 | 0.519167 | 0.134958 | 317.000000 | 2507.000000 | 3194.000000 | 2011-06-25 00:00:00 |
| 50% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.514167 | 0.503146 | 0.623750 | 0.182221 | 738.000000 | 3656.000000 | 4541.000000 | 2011-12-17 00:00:00 |
| 75% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.660000 | 0.613025 | 0.728750 | 0.233221 | 1135.000000 | 4739.000000 | 6041.000000 | 2012-06-09 00:00:00 |
| max | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.861667 | 0.840896 | 0.972500 | 0.507463 | 3410.000000 | 6946.000000 | 8714.000000 | 2012-12-01 00:00:00 |
| std | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.182781 | 0.162584 | 0.141988 | 0.076740 | 693.470674 | 1553.467783 | 1939.766889 | NaN |
Part 2 - Task 2 Visualise rentals of bikes per day.¶
df.set_index('date', inplace=True)
daily_rentals = df.cnt
import matplotlib.pyplot as plt
# Plot
plt.figure(figsize=(12, 6))
daily_rentals.plot(kind='line', color='g', linestyle='-', marker='o')
plt.title('Bike Rentals Per Day')
plt.xlabel('Date')
plt.ylabel('Number of Rentals')
plt.grid(True)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
df.reset_index('date', inplace=True)
# df
# !pip install plotly_calplot
# Heatmap of total number of ride
from plotly_calplot import calplot
# creating the plot
fig = calplot(
df,
x='date',
y='cnt'
)
fig.show()
plt.show()
Analysis of the Bike Rentals Per Day Chart¶
The charts show a clear annual trend in bike rentals. The number of rentals peaks in the summer months (June, July, and August) and is lowest in the winter months (December, January, and February). This seasonal pattern is likely due to favorable weather conditions during the summer months, making cycling more appealing.
Specific Observations:
- Year-over-Year Comparison: There appears to be a slight increase in overall rentals from 2011 to 2012. This could be attributed to factors such as increased bike-sharing infrastructure, the growing popularity of cycling as a mode of transportation, or changes in weather patterns.
- Seasonal Fluctuations: The seasonal peaks and troughs in rentals are consistent across both years. This indicates a strong correlation between weather and bike rental behavior.
- Weekly Patterns: The heatmap chart shows some weekly patterns in bike rentals, with higher demand on weekdays and lower demand on weekends specifically on Sundays.
- Outliers: A few data points deviate significantly from the overall trend. These could be because of extreme weather events, special events, or data collection errors.
Potential Factors Influencing Bike Rentals:
- Weather: Temperature, precipitation, and daylight hours all influence bike rental demand.
- Seasonality: Spring and summer are generally more favorable for cycling.
- Special Events: Major events or holidays can increase or decrease bike rentals.
- Infrastructure: The availability of bike lanes, parking facilities, and bike-friendly routes can impact rental numbers.
- Economic Conditions: Economic factors such as unemployment rates and fuel prices can influence people's choice of transportation.
Suggestions:
To gain a deeper understanding of bike rental trends, additional analysis could be conducted, such as:
- Correlating rentals with weather data: This would help quantify the impact of weather on bike usage.
- Examining weekly and daily patterns: Analyzing rentals on a finer time scale can reveal more detailed insights.
Part 2 - Task 3. Assume that each bike has exactly maximum 12 rentals per day.¶
* Find the maximum number of bicycles `nmax` that was needed in any one day.
* Find the 95%-percentile of bicycles `n95` that was needed in any one day.
# Maximum rentals per day per bike
max_rentals_per_bike = 12
# Calculate the number of bikes needed each day
bikes_needed = np.ceil(daily_rentals / max_rentals_per_bike).astype(int)
df['bikes_needed'] = list(bikes_needed)
# Calculate the maximum number of bikes needed on any one day, which indicates the peak demand for bicycles on any single day.
nmax = bikes_needed.max()
# Calculate the 95th percentile of the number of bikes needed
# This value indicates that 95% of the days had a demand for bicycles less than or equal to this number.
n95 = int(np.percentile(bikes_needed, 95))
print(f"Maximum number of bicycles needed on any one day: {nmax}")
print(f"95th percentile of bicycles needed on any one day: {n95}")
# df
Maximum number of bicycles needed on any one day: 727 95th percentile of bicycles needed on any one day: 632
Part 2 - Task 4 - Visualize the distribution of the covered days depending on the number of available bicycles (e.g. nmax bicycles would cover 100% of days, n95 covers 95%, etc.)¶
# Define bin edges (e.g., bins of size 5)
bin_edges = np.arange(0, bikes_needed.max() + 10, step=5) # Adjust step size as needed
# Calculate the percentage of days covered by each number of bikes
coverage_percentages = []
for ege in bin_edges :
coverage_percentage = (bikes_needed.values <= ege).mean() * 100
coverage_percentages.append(coverage_percentage)
# Plot
plt.figure(figsize=(12, 6))
plt.plot(bin_edges, coverage_percentages, color='b')
plt.title('Coverage of Days Depending on Number of Available Bicycles (Binned)')
plt.xlabel('Number of Bicycles')
plt.ylabel('Percentage of Days Covered')
plt.xticks(rotation=45, ha='right')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()
Analysis of the Chart: Coverage of Days Depending on Number of Available Bicycles¶
The chart displays a clear upward trend, indicating that as the number of available bicycles increases, the percentage of days with sufficient coverage also increases. This reveals the existence of a positive correlation between the availability of bicycles and the ability to fulfill the demand.
Specific Observations:
- Initial Plateau: The graph starts with a relatively flat section, suggesting that even with a small number of bicycles, a certain percentage of days can be covered. This might be due to factors like low demand on some days or the ability to redistribute bicycles from areas with extra supply.
- Steadily Increasing Coverage: As the number of bicycles increases, the coverage rate rises steadily.
- Asymptotic Behavior: The curve seems to approach a maximum value, suggesting that there might be a point of diminishing returns where adding more bicycles doesn't significantly increase coverage. This could be due to factors like congestion of demand or limitations in infrastructure or operational capacity (If the bike supply data was available, we could provide more insight into the market capacity).
Potential Implications:
- Optimal Bicycle Inventory: The chart can help determine the optimal number of bicycles to maintain based on desired coverage levels. By identifying the point where the curve starts to plateau, organizations can assess whether the additional cost of more bicycles justifies the increase in coverage.
- Demand Forecasting: The data can be used to forecast future demand and adjust bicycle inventory accordingly. If demand is expected to increase, organizations can proactively add more bicycles to maintain coverage.
- Infrastructure Planning: The chart can inform decisions about expanding bike-sharing infrastructure, such as adding more docking stations or increasing the fleet size. By understanding the relationship between bicycle availability and coverage, organizations can make data-driven decisions about resource allocation.
Further Analysis:
To gain a deeper understanding of the relationship between bicycle availability and coverage, additional analysis could be conducted, such as:
- Seasonal Variations: Examining the data for seasonal patterns can help identify periods of higher or lower demand.
- Geographic Factors: Analyzing coverage by location can reveal variations in demand across different areas.
- Weather Impacts: Investigating the relationship between weather conditions and bicycle usage can help identify factors that influence demand.
- Competitive Analysis: Comparing the data with information about competing bike-sharing services can provide insights into market dynamics.
Further insights¶
Correlation table¶
import seaborn as sns
cor = df.corr()
plt.figure(figsize=(12, 12))
sns.heatmap(cor, cmap=plt.cm.CMRmap_r,annot=True)
plt.show()
cor['cnt']
date 0.706355 season 0.389125 yr 0.603446 mnth 0.323326 holiday -0.058080 weekday 0.065928 workingday 0.048162 weathersit -0.288684 temp 0.631649 atemp 0.633885 hum -0.101409 windspeed -0.234871 casual 0.680335 registered 0.944966 cnt 1.000000 bikes_needed 0.999999 Name: cnt, dtype: float64
Analysing the linear correlation heatmap diagram¶
Strong Positive Correlations
- Temp (0.631649) and Atemp (0.633885): Both actual temperature and "feels-like" temperature have a strong positive correlation with the rental count, indicating that warmer temperatures lead to higher bike usage.
- Year (0.603446): A positive correlation with the year implies an increasing trend in bike rentals year over year, likely due to the service growing in popularity or geographic expansion (As we have only two values for the year column, the correlation is not significant and reliable).
Weak Positive Correlations
- Weekday (0.065928) and Working Day (0.048162): The very weak positive correlations suggest that the day of the week or whether it is a working day has little impact on the total rental count. Rentals might be fairly consistent throughout the week.
Negative Correlations
- Holiday (-0.058080): This very weak negative correlation suggests that rentals slightly decrease on holidays, but the effect is minimal.
- Humidity (-0.101409): A weak negative correlation with humidity suggests that higher humidity slightly discourages bike rentals, but the effect is small.
Moderate Negative Correlations
- Windspeed (-0.234871): A moderate negative correlation with wind speed indicates that stronger winds are associated with fewer bike rentals, which again is understandable as windy conditions can make biking less enjoyable or more difficult.
- Weather Situation (-0.288684): A moderate negative correlation with weather conditions indicates that worse weather (like rain or snow) leads to fewer bike rentals, which is intuitive.
Summary
- Most Influential Factors: temperature.
- Negative Influences: Weather conditions like worse weather, higher wind speeds, and, with lower impact, higher humidity negatively impact the number of rentals.
- Minimal Impact: Holidays, and whether it's a working day seem to have little effect on the overall rental count.
Part 3 - Building prediction models¶
Tasks:
- Define a test metric for predicting the daily demand for bike sharing, which you would like to use to measure the accuracy of the constructed models, and explain your choice.
- Build a demand prediction model with Random Forest, preferably making use of following python libraries: scikit-learn.
- Report the value of the chosen test metric on the provided data.
Part 3- Task 1- Define a test metric for predicting the daily demand for bike sharing, which you would like to use to measure the accuracy of the constructed models, and explain your choice.¶
Test Metric for Predicting Daily Bike-Sharing Demand
To measure the accuracy of prediction models for daily bike-sharing demand, the Root Mean Squared Error (RMSE) is a suitable test metric. Here's why RMSE is a strong choice:
Definition of RMSE The Root Mean Squared Error (RMSE) is defined as:
$$\text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2}$$
where:
- $n$ is the number of observations (days in this context),
- $y_i$ is the actual demand (i.e., the actual count of bike rentals on day $i$),
- $\hat{y}_i$ is the predicted demand by the model for day $i$.
Why RMSE? RMSE is chosen as the test metric for predicting daily bike-sharing demand because it effectively balances the need to penalize large errors, provides an intuitive and interpretable measure of accuracy, and is a widely accepted standard in regression analysis. By minimizing RMSE, we aim to develop a model that predicts bike demand as accurately as possible, ensuring efficient and effective management of the bike-sharing service.
Comparison to Other Metrics
Mean Absolute Error (MAE): MAE is another common metric but is less sensitive to outliers compared to RMSE. In cases to penalize large errors more (as in this case), RMSE is more appropriate.
Mean Squared Error (MSE): While MSE is the precursor to RMSE, its value is in squared units of the target variable, making it harder to interpret. RMSE resolves this by taking the square root.
R-squared ($R^2$): $R^2$ provides a relative measure of fit, indicating the proportion of variance explained by the model. While useful for understanding overall model performance, it doesn't provide the same level of direct, interpretable insight into the magnitude of prediction errors as RMSE does.
Part 3- Task 2- Build a demand prediction model with Random Forest, preferably making use of following python libraries: scikit-learn.¶
df.columns
Index(['date', 'season', 'yr', 'mnth', 'holiday', 'weekday', 'workingday',
'weathersit', 'temp', 'atemp', 'hum', 'windspeed', 'casual',
'registered', 'cnt', 'bikes_needed'],
dtype='object')
# Some non-numerical columns like 'date' must be removed for modeling.
# Columns of 'casual', 'registered', and 'bikes_needed' are yielded from the target or the target is yielded from them so that they will be removed
to_drop = ['date', 'casual', 'registered', 'bikes_needed']
df.drop(to_drop, axis=1, inplace=True)
df.head()
| season | yr | mnth | holiday | weekday | workingday | weathersit | temp | atemp | hum | windspeed | cnt | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 1 | 0 | 6 | 0 | 2 | 0.344167 | 0.363625 | 0.805833 | 0.160446 | 985 |
| 1 | 1 | 0 | 1 | 0 | 0 | 0 | 2 | 0.363478 | 0.353739 | 0.696087 | 0.248539 | 801 |
| 2 | 1 | 0 | 1 | 0 | 1 | 1 | 1 | 0.196364 | 0.189405 | 0.437273 | 0.248309 | 1349 |
| 3 | 1 | 0 | 1 | 0 | 2 | 1 | 1 | 0.200000 | 0.212122 | 0.590435 | 0.160296 | 1562 |
| 4 | 1 | 0 | 1 | 0 | 3 | 1 | 1 | 0.226957 | 0.229270 | 0.436957 | 0.186900 | 1600 |
Fitting the RF model¶
# A simple Random forest
from sklearn.ensemble import RandomForestRegressor
X, y = df.iloc[:,:-1], df.iloc[:,-1]
RFreg_simple = RandomForestRegressor()
RFreg_simple.fit(X, y)
RandomForestRegressor()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestRegressor()
# The columns that weres deleted in training must be eliminated in testing
df_test = df_last30.drop(['instant', 'dteday', 'casual', 'registered'], axis=1)
Part 3- task 3- Report the value of the chosen test metric on the provided data¶
Evaluating the model¶
# Evaluation for the simple random forest
from sklearn.metrics import *
X_test , y_test = df_test.iloc[:,:-1], df_test.iloc[:,-1]
simpleRF_prediction = RFreg_simple.predict(X_test )
print('RMSE_train:', root_mean_squared_error(y, RFreg_simple.predict(X)))
print('RMSE_test:', root_mean_squared_error(y_test, simpleRF_prediction))
print('MSE of test:', mean_squared_error(y_test, simpleRF_prediction))
print('MAE of test:', mean_absolute_error(y_test, simpleRF_prediction))
print("Test R^2:", r2_score(y_test, simpleRF_prediction))
RMSE_train: 242.69291532026153 RMSE_test: 1143.3229617304114 MSE of test: 1307187.3948199996 MAE of test: 895.3226666666666 Test R^2: 0.5917602875761707
RMSE shows a high level of error. Maybe Tuning the regressor model improves the error
Part 4 - Reflection / comments¶
Tasks:
(Optional) Please share with us any free form reflection, comments or feedback you have in the context of this test task.
Part 4- Hyper parameter optimization of RF¶
Please do not run the following cell if you do not want to wait a few hours :))
'Random Forest'
from sklearn.model_selection import TimeSeriesSplit
from sklearn.model_selection import GridSearchCV
## Using grid search
rf = RandomForestRegressor()
# # The function to measure the quality of a split.
criterion = [None, 'friedman_mse', 'poisson','squared_error', 'absolute_error' ]
# Number of trees in random forest
n_estimators = list(range(50,400,50))
# Number of features to consider at every split
max_features = [None, 2, 3, 4, 5, 6, 7, 8,9,10, 11, 'auto', 'sqrt', 'log2']
# Maximum number of levels in tree
max_depth = list(range(3, 15))
max_depth.append(None)
# Minimum number of samples required to split a node
# min_samples_split = [2, 4, 6, 8, 10]
# Minimum number of samples required at each leaf node
# min_samples_leaf = [1, 2, 3, 4]
# Create the random grid
random_grid = {
'criterion': criterion,
'n_estimators': n_estimators,
'max_features': max_features,
'max_depth': max_depth,
# 'min_samples_split': min_samples_split,
# 'min_samples_leaf': min_samples_leaf,
}
#The data is time-series so we need to use a time-series K-fold split
tscv = TimeSeriesSplit(n_splits=3)
tuning_model=GridSearchCV(rf, param_grid=random_grid, n_jobs=-1,
scoring='neg_root_mean_squared_error', cv=tscv, verbose=1, return_train_score=True)
tuning_model.fit(X, y)
# best hyperparameters
print('Best params:', tuning_model.best_params_)
Fitting 3 folds for each of 4368 candidates, totalling 13104 fits
C:\Program Files\Python310\lib\site-packages\sklearn\model_selection\_validation.py:547: FitFailedWarning:
1008 fits failed out of a total of 13104.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.
Below are more details about the failures:
--------------------------------------------------------------------------------
634 fits failed with the following error:
Traceback (most recent call last):
File "C:\Program Files\Python310\lib\site-packages\sklearn\model_selection\_validation.py", line 895, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "C:\Program Files\Python310\lib\site-packages\sklearn\base.py", line 1467, in wrapper
estimator._validate_params()
File "C:\Program Files\Python310\lib\site-packages\sklearn\base.py", line 666, in _validate_params
validate_parameter_constraints(
File "C:\Program Files\Python310\lib\site-packages\sklearn\utils\_param_validation.py", line 95, in validate_parameter_constraints
raise InvalidParameterError(
sklearn.utils._param_validation.InvalidParameterError: The 'max_features' parameter of RandomForestRegressor must be an int in the range [1, inf), a float in the range (0.0, 1.0], a str among {'log2', 'sqrt'} or None. Got 'auto' instead.
--------------------------------------------------------------------------------
374 fits failed with the following error:
Traceback (most recent call last):
File "C:\Program Files\Python310\lib\site-packages\sklearn\model_selection\_validation.py", line 895, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "C:\Program Files\Python310\lib\site-packages\sklearn\base.py", line 1467, in wrapper
estimator._validate_params()
File "C:\Program Files\Python310\lib\site-packages\sklearn\base.py", line 666, in _validate_params
validate_parameter_constraints(
File "C:\Program Files\Python310\lib\site-packages\sklearn\utils\_param_validation.py", line 95, in validate_parameter_constraints
raise InvalidParameterError(
sklearn.utils._param_validation.InvalidParameterError: The 'max_features' parameter of RandomForestRegressor must be an int in the range [1, inf), a float in the range (0.0, 1.0], a str among {'sqrt', 'log2'} or None. Got 'auto' instead.
C:\Program Files\Python310\lib\site-packages\sklearn\model_selection\_search.py:1051: UserWarning:
One or more of the test scores are non-finite: [-1668.23860228 -1636.20084943 -1644.36404681 ... -1388.85966757
-1368.65889195 -1374.33253737]
C:\Program Files\Python310\lib\site-packages\sklearn\model_selection\_search.py:1051: UserWarning:
One or more of the train scores are non-finite: [-666.46585393 -632.16682121 -644.91144761 ... -205.95360296 -206.22356903
-206.4015325 ]
Best params: {'criterion': 'absolute_error', 'max_depth': 12, 'max_features': 5, 'n_estimators': 100}
results = []
results = pd.DataFrame.from_dict(tuning_model.cv_results_)
train_score= results['mean_train_score']
train_score_std= results['std_train_score']
cv_score = results['mean_test_score']
cv_score_std= results['std_test_score']
param_max_depth = results['param_max_depth']
print(results.shape)
results.head()
(4368, 20)
| mean_fit_time | std_fit_time | mean_score_time | std_score_time | param_criterion | param_max_depth | param_max_features | param_n_estimators | params | split0_test_score | split1_test_score | split2_test_score | mean_test_score | std_test_score | rank_test_score | split0_train_score | split1_train_score | split2_train_score | mean_train_score | std_train_score | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.076591 | 0.002868 | 0.003336 | 4.717188e-03 | friedman_mse | 3 | 2 | 50 | {'criterion': 'friedman_mse', 'max_depth': 3, ... | -874.400826 | -2163.260733 | -1967.054247 | -1668.238602 | 567.014437 | 4030 | -470.017185 | -616.310180 | -913.070196 | -666.465854 | 184.319806 |
| 1 | 0.151746 | 0.007192 | 0.005208 | 7.364684e-03 | friedman_mse | 3 | 2 | 100 | {'criterion': 'friedman_mse', 'max_depth': 3, ... | -874.024867 | -2164.108577 | -1870.469105 | -1636.200849 | 552.111224 | 4016 | -430.341695 | -591.963150 | -874.195618 | -632.166821 | 183.419060 |
| 2 | 0.224064 | 0.007274 | 0.015623 | 1.123916e-07 | friedman_mse | 3 | 2 | 150 | {'criterion': 'friedman_mse', 'max_depth': 3, ... | -871.098234 | -2157.454362 | -1904.539545 | -1644.364047 | 556.444959 | 4024 | -441.387972 | -600.095792 | -893.250579 | -644.911448 | 187.174226 |
| 3 | 0.299290 | 0.013429 | 0.014777 | 1.196802e-03 | friedman_mse | 3 | 2 | 200 | {'criterion': 'friedman_mse', 'max_depth': 3, ... | -856.963866 | -2154.028673 | -1931.999919 | -1647.664152 | 566.409392 | 4026 | -446.831962 | -593.749770 | -881.394468 | -640.658733 | 180.483575 |
| 4 | 0.372235 | 0.013515 | 0.014155 | 8.067228e-03 | friedman_mse | 3 | 2 | 250 | {'criterion': 'friedman_mse', 'max_depth': 3, ... | -828.944064 | -2156.715865 | -1838.925587 | -1608.195172 | 566.081197 | 3988 | -442.465594 | -599.835763 | -873.659792 | -638.653716 | 178.161414 |
plt.rcParams["figure.figsize"] = [9, 6]
plt.rcParams["figure.autolayout"] = True
plt.plot(param_max_depth.astype(float), np.abs(train_score), label = 'Train')
plt.plot(param_max_depth.astype(float), np.abs(cv_score), label = 'Validation')
plt.title('Average Training error in Grid Search CV vs. Max_depth for both validation and training phases')
plt.xlabel('max_depth')
plt.ylabel('Mean train error')
plt.legend()
plt.show()
Analyzing the Training Score vs. Max_Depth Plot
The plot shows on aspect of hyperparameter tuning with Grid Search CV. It visualizes the relationship between the average training score and the max_depth hyperparameter for both the training and validation phases of Random Forest model.
Key Observations:
- Training Score: The training error, represented by the blue line, generally decreases as the
max_depthincreases. This is expected because deeper trees can capture more complex patterns in the training data, leading to better performance on the training set. - Validation Score: The validation score, represented by the orange line, initially decreases with
max_depthbut eventually starts to increase. This indicates that while deeper trees perform well on the training data, they might be overfitting, leading to poorer generalization performance on unseen data. - Overfitting: The gap between the training and validation scores widens as
max_depthincreases. This suggests that the model is becoming increasingly overfit, learning the training data too well and failing to generalize to new data. - Optimal Max_Depth: The optimal
max_depthis likely the point where the validation score is lowest. This is the point where the model generalizes best to unseen data without overfitting.
Part 4- Fiting the best model¶
# The parameters of the model is set based on tuning_model.best_params_
# Best params: {'criterion': 'absolute_error', 'max_depth': 12, 'max_features': 5, 'n_estimators': 100}
Best_RF_model = RandomForestRegressor(criterion='absolute_error', max_depth= 12, max_features=5,
n_estimators= 100, n_jobs=-1)
Best_RF_model.fit(X, y)
RandomForestRegressor(criterion='absolute_error', max_depth=12, max_features=5,
n_jobs=-1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestRegressor(criterion='absolute_error', max_depth=12, max_features=5,
n_jobs=-1)Part 4- Evaluating the best model¶
# prediction
predict = Best_RF_model.predict(X_test)
train_predict = Best_RF_model.predict(X)
###Metrics
print('RMSE_train:', root_mean_squared_error(y, train_predict))
print('RMSE_test:', root_mean_squared_error(y_test, predict))
print('MSE of test:', mean_squared_error(y_test, predict))
print('MAE of test:', mean_absolute_error(y_test, predict))
print("Test R^2:", r2_score(y_test, predict))
RMSE_train: 273.1367232023141 RMSE_test: 1037.2987758645047 MSE of test: 1075988.75041 MAE of test: 809.3443333333332 Test R^2: 0.6639645243066773
The model metrics did not change a lot.
This means either:
- Random Forest cannot capture seasonality, remove variance errors, or capture time-related patterns. These make it not a proper model for this problem or
- Hyperparameter tuning was not enough deep and granular!
Suggestion:
Applying models that are designed for time-series data.
- Applying LSTM, RNN, or GRU model to this data. In this case, collecting and providing more data is necessary to fit the DNNs.
- Models like ARIMA and FBProphet can be used to deal with low-number data.
Applying more advanced tuning on the RF model to achieve a better result.
This approach is not recommended Since we already have overfiting (
RMSE_train =273andRMSE_test = 1037).
Overall outcome:
Considering all the circumstances, Applying the ARIMA model, FBPropghet or similar methods are the potential solutions.
# Visualizing y_test vs. prediction
import numpy as np
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = [15, 8]
plt.rcParams["figure.autolayout"] = True
x = list(range(len(y_test)))
y1 = predict
y2 = y_test
# plt.legend()
plt.subplot(211)
plt.title('Prediction Vs. Actual value')
plt.xlabel('index')
plt.ylabel('Number of rides')
plt.plot(x, y1, color='red', lw=5, label='Ride prediction')
plt.plot(x, y2, color='Blue', lw=3, label='Actual number of rides')
plt.legend()
<matplotlib.legend.Legend at 0x164e5d3d390>
Analysis of the Ride Prediction Chart
The chart compares predicted ride numbers to actual bike ride numbers daily. The general trend indicates that the prediction model is moderately accurate, with the predicted values following the overall shape of the actual values.
Specific Observations:
- Correlation: There's a clear correlation between the predicted and actual values, suggesting that the model is capturing some underlying patterns in the data.
- Underestimation and Overestimation: In certain periods, the model underestimates the actual number of rides, while in others it overestimates. This indicates that the model's accuracy varies over time.
- Lag: There might be a slight lag between the predicted and actual values. This could be due to factors like the model's training data or the inherent time delay in the system.
- Outliers: A few data points deviate significantly from the overall trend. These could be due to unusual events or anomalies in the data.
Error = pd.DataFrame(list(predict - y_test), columns=['Ride Error distribution'])
g = sns.displot(Error, kde=True, kind="hist", bins=20)
g.set_titles("Error distribution")
Error
| Ride Error distribution | |
|---|---|
| 0 | -1595.715 |
| 1 | 52.495 |
| 2 | 35.360 |
| 3 | 659.840 |
| 4 | -1507.250 |
| 5 | -301.305 |
| 6 | -2246.790 |
| 7 | 217.915 |
| 8 | -999.320 |
| 9 | -412.075 |
| 10 | -145.020 |
| 11 | -314.990 |
| 12 | -442.040 |
| 13 | 258.100 |
| 14 | -607.625 |
| 15 | -937.865 |
| 16 | 288.815 |
| 17 | 36.080 |
| 18 | 1051.495 |
| 19 | 344.395 |
| 20 | 1340.040 |
| 21 | 1455.160 |
| 22 | 2156.880 |
| 23 | 2626.990 |
| 24 | 2131.260 |
| 25 | 992.680 |
| 26 | 370.870 |
| 27 | 1707.290 |
| 28 | 1088.540 |
| 29 | 491.890 |
sns.distplot(Error, kde=True, bins=20)
C:\Users\Akhod\AppData\Local\Temp\ipykernel_23292\1523101246.py:1: UserWarning: `distplot` is a deprecated function and will be removed in seaborn v0.14.0. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). For a guide to updating your code to use the new functions, please see https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751
<Axes: ylabel='Density'>
Analyzing the two Ride Error Histograms
The first histogram presents the distribution of ride errors, representing the difference between predicted and actual ride values. The x-axis shows the error value group intervals for the number of rides, which means that the errors are categorized into bins based on the number of rides predicted or actual. The second one is the same histogram while the y-coordinate indicates the density of each bin.
- Distribution Shape: The histogram indicates a roughly normal distribution of ride errors, with a central peak and symmetrical tails. This suggests that the model's predictions are generally accurate, with errors distributed relatively evenly around the mean.
- Skewness: There's a slight leftward skew to the distribution, indicating that there are slightly more negative errors (underestimations) than positive errors (overestimations).
- Outliers: A few outliers are visible, particularly on the left side of the distribution. These represent extreme cases where the model significantly underestimated ride values.
- Central Tendency: The majority of errors are clustered around the 0 mark, indicating that the model's predictions are generally accurate.
- Spread: The distribution is relatively spread out, suggesting that there is some variability in the model's errors.
Permutation Test for Model Significance¶
Conduct a permutation test to assess the statistical significance of the tuned RF model.
For calculation of P_value The link is used:
from sklearn.utils import shuffle
# Step 1: Train the actual model and calculate % variance explained (R-squared)
Best_RF_model.fit(X, y)
y_pred = Best_RF_model.predict(X_test)
actual_r2 = r2_score(y_test, y_pred)
# Step 2: Permute the dependent variable and calculate % variance explained repeatedly
n_permutations = 1000 # By increasing this to 10,000 or more better accuracy is expected.
permuted_r2_scores = []
for _ in range(n_permutations):
# Permute the dependent variable
y_train_permuted = np.random.permutation(y)
# Fit a Random Forest model on the permuted data
Best_RF_model.fit(X, y_train_permuted)
# Predict and calculate the R-squared
y_pred_permuted = Best_RF_model.predict(X_test)
permuted_r2 = r2_score(y_test, y_pred_permuted)
permuted_r2_scores.append(permuted_r2)
# Step 3: Calculate the p-value
permuted_r2_scores = np.array(permuted_r2_scores)
p_value = np.sum(permuted_r2_scores >= actual_r2) / n_permutations
# Output the results
print(f"Actual R-squared: {actual_r2:.4f}")
print(f"P-value: {p_value:.4f}")
Actual R-squared: 0.6684 P-value: 0.0000
The P_value is almost zero
This indicates that the provided model is statistically significant, which means the map (e.g. model's weights or tree stucture) between features and the target is not random or arbitary. This model is statistically more than $99\%$ reliable.
Feature importance¶
importances = Best_RF_model.feature_importances_
indices = np.argsort(importances)[::-1] # Sort the feature importances in descending order
# Names of the features
features = X.columns if hasattr(X, 'columns') else [f'Feature {i}' for i in range(X.shape[1])]
# Create the plot
# Bar Plot of Feature Importances
plt.figure(figsize=(10, 6))
plt.title("Feature Importances")
plt.bar(range(X.shape[1]), importances[indices], align="center")
plt.xticks(range(X.shape[1]), [features[i] for i in indices], rotation=90)
plt.xlabel("Feature")
plt.ylabel("Importance")
plt.show()
Analyzing the Feature Importance Bar Chart
- Most Important Feature: The year feature is the most important feature according to the chart, contributing significantly to the model's predictions.
- Temperature-Related Features: Features related to temperature ("temp" and "atemp") also appear to be quite important, suggesting that temperature plays a significant role in the number of rides variable.
- Humidity and Month: Humidity and month are also moderately influence on the model's predictions.
- Seasonal and Weather-Related Features: Features like "season," "windspeed," "weathersit," and "weekday" have lower importance scores, suggesting that they might be less influential or have more complex relationships with the target variable.
- Working Day and Holiday: The features "workingday" and "holiday" have the lowest importance scores, indicating that they have minimal impact on the model's predictions. They can be a good candidate for feature elimination in the Feature Selection phase.
Answers / comments / reasoning:
Submission¶
Please submit this notebook with your developments in .ipynb and .html formats as well as your requirements.txt file.
References¶
[1] Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.